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Abstract 

Weak topic correlation across document 
collections with different numbers of 
topics in individual collections presents 
challenges for existing cross-collection 
topic models. This paper introduces 
two probabilistic topic models. Correlated 
LDA (C-LDA) and Correlated HDP (C- 
HDP). These address problems that can 
arise when analyzing large, asymmetric, 
and potentially weakly-related collections. 
Topic correlations in weakly-related col¬ 
lections typically lie in the tail of the topic 
distribution, where they would be over¬ 
looked by models unable to fit large num¬ 
bers of topics. To efficiently model this 
long tail for large-scale analysis, our mod¬ 
els implement a parallel sampling algo¬ 
rithm based on the Metropolis-Hastings 
and alias methods ( [Yuan et al., 2015 ). 

The models are first evaluated on syn¬ 
thetic data, generated to simulate vari¬ 
ous collection-level asymmetries. We then 
present a case study of modeling over 
300k documents in collections of sciences 
and humanities research from JSTOR. 

1 Introduction 

Comparing large text collections is a critical task 
for the curation and analysis of human cultural 
history. Achievements of research and schol¬ 
arship are most accessible through textual arti¬ 
facts, which are increasingly available in digital 
archives. Text-based research, often undertaken 
by humanists, historians, lexicographers, and cor¬ 


pus linguists, explores patterns of words in docu¬ 
ments across time-periods and distinct collections 
of text. Here, we introduce two new topic models 
designed to compare large collections. Correlated 
LDA (C-LDA) and Correlated HDP (C-HDP), 
which are sensitive to document-topic asymme¬ 
try (where collections have different topic distribu¬ 
tions) and topic-word asymmetry (where a single 
topic has different word distributions in each col¬ 
lection). These models seek to address termino¬ 
logical questions, such as how a topic on physics 
is articulated distinctively in scientific compared 
to humanistic research. Accommodating poten¬ 
tial collection-level asymmetries is particularly 
important when researchers seek to analyze col¬ 
lections with little prior knowledge about shared 
or collection-specific topic structure. Our mod¬ 
els extend existing cross-collection approaches to 
accommodate these asymmetries and implement 
an efficient parallel sampling algorithm enabling 
users to examine the long tail of topics in particu¬ 
larly large collections. 


Using topic models for comparative text min¬ 
ing was introduced by Zhai et al. (20041, who de¬ 
veloped the ccMix model which extended pLSA 
(Hofmann, 1999|. Later work by [Paul and Girju 


(2009|l developed ccLDA, which adopted the hier¬ 


archical Bayes framework of Latent Dirichlet Al¬ 
location or LDA dBlei et ak, 2003[ ). These mod¬ 
els account for topic-word asymmetry by assum¬ 
ing variation in the vocabularies of topics is due 
to collection-level differences. Nevertheless, they 
require the same topics to be present in each col¬ 
lection. These models are useful for comparing 
collections under specific assumptions, but cannot 
accommodate collection-topic asymmetry (which 
















arises in collections that do not share every topic 
or that have different numbers of topics). In situa¬ 
tions where collections do not share all topics, the 
results often include junk, mixed, or sparse top¬ 


ics, making them difficult to interpret (Paul and 


Girju, 20091. Such asymmetries make it difficult 


to use models like ccLDA and ccMix when little 
is known about collections in advance. This mo¬ 
tivates our efforts to model variation in the long 
tail of topic distributions, where correlations are 
more likely to appear when collections are weakly 
related. 


C-LDA and C-HDP extend ccLDA (Paul and 


Girju, 20091 to accommodate collection-topic 


level asymmetries, particularly by allowing non¬ 
common topics to appear in each collection. This 
added flexibility allows our models to discover 
topic correlations across arbitrary collections with 
different numbers of topics, even when there are 
few (or unknown) numbers of common topics. To 
demonstrate the effectiveness of our models, we 
evaluate them on synthetic data and show that they 
outperform related models such as ccLDA and dif¬ 
ferential topic models ( |Chen et ah, 2014 ). We then 
fit C-LDA to two large collections of humanities 
and sciences documents from JSTOR. Such histor¬ 
ical analyses of text would be intractable without 
an efficient sampler. An optimized sampler is re¬ 
quired in such situations because common topics 
in weakly-correlated collections are usually found 
in the tail of the document-topic distribution of a 
sufficiently large set of topics. To make this fea¬ 
sible on large datasets such as JSTOR, we employ 


a parallelized Metropolis-Hastings (Kronmal and 


Peterson Jr, 1979|) and alias-table sampling frame¬ 


work, adapted from LightLDA (Yuan et ah, 20151. 
These optimizations, which achieve 0{1) amor¬ 
tized sampling time per token, allow our models 
to be fit to large corpora with up to thousands of 
topics in a matter of hours — an order of magni¬ 
tude speed-up from ccLDA. 


After reviewing work related to topic modeling 
across collections, section [^describes C-LDA and 
C-HDP, and then details their technical relation¬ 
ship to existing models. Section introduces the 
synthetic data and part of the JSTOR corpus used 
in our evaluations. We then compare our models’ 
performances to other models in terms of hold¬ 
out perplexity and a measure of distinguishabil- 
ity. The final results section exemplifies the use 
of C-LDA in a qualitative analysis of humanities 


and sciences research. We conclude with a brief 
discussion of the strengths of C-LDA and C-HDP, 
and outline directions for future work and applica¬ 
tions. 


2 Related Work 


Our models seek to enable users to compare large 
collections that may only be weakly correlated 
and that may contain different numbers of topics. 
While topic models could be fit to separate collec¬ 


tions to make post-hoc comparisons (Denny et ah, 
2014| Yang et ah, 20111, our goal is to account for 
both document-topic asymmetry and topic-word 
asymmetry “in-model”. In short, we seek to model 
the correlation between arbitrary collections. Pri¬ 
oritizing in-model solutions for document-topic 
asymmetry has been explored elsewhere, such as 
in hierarchical Dirichlet processes (HDP), which 
use an additional level to account for collection 


variations in document-topic distributions (Teh et 


ah, 20061. 


One method designed to model topic-word 


asymmetry is ccMix (Zhai et ah, 20041, which 
models the generative probability of a word in 
topic 2 ; from collection c as a mixture of shared 
and collection-specific distributions 9^: 

p{w) = XcP{w\9z) + (1 - Xc)piw\9z,c) 

where 9z,c is collection-specific and Ac controls 
the mixing between shared and collection-specific 
topics. ccLDA extends ccMix to the LDA frame¬ 
work and adds a beta prior over Ac that reduces 


sensitivity to input parameters (Paul and Girju, 


20091. Another approach, differential topic mod¬ 
els dChen et ah, 2014) ), is based on hierarchical 
Bayesian models over topic-word distributions. 
This method uses the transformed Pitman-Yor pro¬ 
cess (TPYP) to model topic-word distributions in 
each collection, with shared common base mea¬ 
sures. As ([Paul and Girju, 2009|l note, ccLDA 


cannot accommodate a topic if it is not com¬ 
mon across collections — an assumption made by 
ccMix, ccLDA and the TPYP. In a situation where 
a topic is found in only one collection, it would 
either dominate the shared topic portion (resulting 
in a noisy, collection-specific portion), or it would 
appear as a mixed topic, revealing two sets of un¬ 
related words ( Newman et ah, 2010b[ ). C-LDA 
ameliorates this situation by allowing the number 
of common and non-common topics to be speci¬ 
fied separately and by efficiently sampling the tail 




























of the document-topic distribution, allowing users 
to examine less prominent regions of the topic 
space. C-HDP also grants collections document- 
topic independence using a hierarchical structure 
to model the differences between collections. 

Due to increased demand for scalable topic 
model implementations, there has been a prolif¬ 
eration of optimized methods for efficient infer¬ 
ence, such as SparseLDA (Yao et ah, 20091 and 
AliasLDA ( |Li et ah, 2014[ ). AliasLDA achieves 
0{Kd) complexity by using the Metropolis- 
Hastings-Walker algorithm and an alias table to 
sample topic-word distributions in 0(1) time. Al¬ 
though this strategy introduces temporal staleness 
in the updates of sufficient statistics, the lag is 
overcome by more iterations, and converges sig¬ 


nificantly faster. A similar technique by Yuan et al. 


(20151, LightLDA, employs cycle-based Metropo¬ 


lis Hastings mixing with alias tables for both 
document-topic and topic-word distributions. De¬ 
spite introducing lag in the sufficient statistics, 
this method achieves 0(1) amortized sampling 
complexity and results in even faster convergence 
than AliasLDA. In addition to being fully paral¬ 
lelized, C-LDA adopts this sampling framework to 
make comparing large collections more tractable 
for large numbers of topics. Our models’ efficient 
sampling methods allow users to fit large num¬ 
bers of topics to big datasets where variation might 
not be observed in sub-sampled datasets or models 
with fewer topics. 


3 The Models 

3.1 Correlated LDA 

In ccLDA (and ccMix), each topic has shared 
and collection-specific components for each col¬ 
lection. C-LDA extends ccLDA to make it more 
robust with respect to topic asymmetries between 
collections (Figure la). The crucial extension is 
that by allowing each collection to define a sef of 
non-common topics in addition to common top¬ 
ics, the model removes an assumption imposed by 
ccLDA and other inter-collection models, namely 
that collections have the same number of topics. 
As a result, C-LDA is suitable for collections with¬ 
out a large proportion of common topics, and can 
also reduce noise (discussed in Section |^. To 
achieve this, C-LDA assumes document d in col¬ 
lection c has a multinomial document-topic dis¬ 
tribution 9 with an asymmetric Dirichlet prior 
for Kc topics, where the first iT® are common 


across collections. It is also possible to introduce 
a tree structure into the model that uses a bino¬ 
mial distribution to decide whether a word was 
drawn from common or non-common topics. This 
yields collection-specific background topics by us¬ 
ing a binomial distribution instead of a multino¬ 
mial. However, we prefer the simpler, non-tree 
version because background topics are 
sary when using an asymmetric a prior 
et ah, 20()9a] ). 

The generative process for C-LDA is as follows: 

1. Sample a distribution (pk (shared component) from Dir(/() 

and a distribution cr* from Beta((5i, 52 ) for each common 
topic k £ A®}; 

2. For each collection c, sample a distribution p'j. (collection- 
specific component) from Dir(/3) for each common topic 
k £ {1 ,..., A®} and non-common topic k £ {A® -|- 
1,...,A4; 

3. For each document d in c, sample a distribution 6 from 
Dir(ac); 

4. For each word Wi in d: 

(a) Sample a topic Zi £ Ac} from Multi(0); 

(b) If Zi < A®, sample t/; from Binomial(crz.); 

(c) Sample Wi from Multi(<()|^), where 
null , Zi < A® and yi = 0; 
c , otherwise. 

Note that to capture common topics, A® should 
be set such that 3 c where Kc = A®. Other¬ 
wise, words sampled as a non-common topic will 
not have information about non-common topics in 
other collections. Then a “common-topic word” 
is found among non-common topics in all col¬ 
lections (a local minima) and it will take a long 
time to stabilize as a common topic. To avoid 
this, when determining the number of topics for 
sampling, the number of non-common topics for 
the collection with the smallest number of total 
topics should be zero. After inference, to distin¬ 
guish common and non-common topics in this col¬ 
lection, we model a independently by assuming 
collections have the same mixing ratio for com¬ 
mon topics. With this reasonable assumption and 
an asymmetric a, common topics become sparse 
enough that some a distributions reduce nearly to 
0, distinguishing them as non-common topics. Al¬ 
though this may seem counterintuitive, it does not 
negatively affect results. 

Three kinds of collection-level imbalance can 
confound inter-collection topic models: 1) in the 
numbers of topics between collections, 2) in the 
numbers of documents between collections, and 
3) in the document-topic distributions. Each of 
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Figure 1: Graphical models of C-LDA (a; left) and C-HDP (b; right). 


these can cause topics in different collections to 
have significantly different numbers of words as¬ 
signed to the same topic. In this way, a topic can 
be dominated by the collection comprising most 
of its words. C-LDA addresses imbalances in the 
document-topic distributions between collections 
by estimating a. For imbalance in the number of 
topics and documents, C-LDA mimics document 
over-sampling in the Gibbs sampler using a differ¬ 
ent unit-value in the word count table for each col¬ 
lection. Specifically, a unit r\c is chosen for each 
collection such that the average equivalent num¬ 
ber of assigned words per-topic ^c, 

where is the length of document d) is equal. 
This process both increases the topic quality (in 
terms of collection balance) in the resulting held- 
out perplexity of the model. 


3.2 Correlated HDP 


To alleviate C-LDA’s requirement that 3 c such 
that Kc = iT®, we introduce a variant of the 
model, the correlated hierarchical Dirichlet pro¬ 
cess (C-HDP), that uses a 3-level hierarchical 


Dirichlet process (Teh et ah, 20061. The gener¬ 
ative process for C-HDP is the same as C-LDA 
shown above, except that here we assume a word’s 
topic, z, is generated by a hierarchical Dirichlet 
process: 


Go|7,^^ ~ DP(7,/f) 

Gc|ao,Go ~ DP(q;o)Go) 

Gd\oti-,Gc ~ DP(q;i,Gc) 

z\Gd ~ Gd 

where Gq is a base measure for each collection- 
level Dirichlet process, and Gc are base measures 
of document-level Dirichlet processes in each col¬ 
lection (Figure lb). Thus, documents from the 


same collection will have similar topic distribu¬ 
tions compared to those from other collections, 
and collections are allowed to have distinct sets of 
topics due to the use of HDP. 

4 Inference 

4.1 Posterior Inference in C-LDA 

C-LDA can be trained using collapsed Gibbs sam¬ 
pling with (j), 9, and a integrated out. Given the 
status assignments of other words, the sampling 
distribution for word Wi is given by: 


p{yi,zi\w,y_i,z-i,6,a,l3) 
oc {N{d, Zi) + ac,zi) 


X 


Qd 

N{yuZi)+5y^ N{wi,yi,Zi,C,) + d ^ 
^(^i) + Efc4 N{yi,Zi,0 + Vl3 

Zi>K'!> 


N(wi, Zi,c) + d 

' NXzi,c) + VI3' 


qw 


( 1 ) 


is the number of 


where C = |* 

status assignments for (• • •), not including Wi. 

Inference in C-LDA employs two optimiza¬ 
tions: a parallelized sampler and an efficient sam¬ 
pling algorithm (Algorithm [T]). We use the paral¬ 
lel schema in (|Smola and Narayanamurthy, 2010 


Lu et ah, 20131 which applies atomic updates to 
the sufficient statistics to avoid race conditions. 
The key idea behind the optimized sampler is the 
combination of alias tables and the Metropolis- 
Hastings method (MH), adapted from ([Yuan et ^ 


2015| Li et ah, 2014| |. Metropolis-Hastings is a 
Markov chain Monte Carlo method that uses a pro¬ 
posal distribution to approximate the true distribu- 




























































Algorithm 1 Sampling in C-LDA 
repeat 

for all documents {d} in parallel do 
for words {w} in d do 

2; ^ CycleMH(p, qn,,qd, z) 
sample y given 2 
Atomic update sufficient statics 

Estimate a 
until convergence 

procedure CycleMH(p, qw,qd, z) 

for i = 1 to A do 
if i is even then 
proposal q qw 
else 

proposal q qd 
sample 2' ~ ALIASTABLE(g) 
if RandUniffl) < min(l, then 

4^ _ 2,' 

return 2: 


tion when exact sampling is difficult. In a compli¬ 


mentary way, Walker’s alias method (20041 allows 
one to effectively sample from a discrete distribu¬ 
tion by using an alias table, constructed in 0{K) 
time, from which we can sample in 0(1). Thus, 
reusing the sampler K times as the proposal distri¬ 
bution for Metropolis-Hastings yields 0(1) amor¬ 
tized sampling time per-token. 

Notice that in Eq.[T] the sampling distribution is 
the product of a single document-dependent term 
qd and a single word-dependent term qw After 
burn-in, both terms will be sparse (without the 
smoothing factor). It is therefore reasonable to use 


qd and q^, as cycle proposals ( [Yuan et ah, 2015| l, 
alternating them in each Metropolis-Hastings step. 
Our experiments show that the primary drawback 
of this method — stale sufficient statistics — does 
not empirically affect convergence. Our imple¬ 
mentation uses proposal distributions q^ and qd, 
with y marginalized out. After the Metropolis- 
Hastings steps, y is sampled to update z, to reduce 
the size of the alias tables, yielding even faster 
convergence. 

Lastly, the use of an asymmetric a allows C- 
LDA to discover correlations between less dom¬ 


inant topics across collections (Wallach et ah. 


2009a I. We use Minka’s fixed-point method, with 


a gamma hyper-prior to optimize ac for each col¬ 


lection separately (Wallach, 2008 1 . All other hy¬ 


perparameters were fixed during inference. 


4.2 Posterior Inference in C-HDP 

C-HDP uses the block sampling algorithm de¬ 
scribed in ( |Chen et ah, 201 1| ), which is based on 


the Chinese restaurant process metaphor. Here, 
rather than tracking all assignments (as the sam¬ 


plers given in (Teh et ah, 2006 1 ), table indicators 
are used to track only the start of new tables, which 
allows us to adopt the same sampling framework 
as C-LDA. In the Chinese restaurant process, each 
Dirichlet process in the hierarchical structure is 
represented as a restaurant with an infinite num¬ 
ber of tables, each serving the same dish. New 
customers can either join a table with existing cus¬ 
tomers, or start a new table. If a new table is cho¬ 
sen, a proxy customer will be sent to the parent 
restaurant to determine the dish served to that ta¬ 
ble. 

In the block sampler, indicators are used to de¬ 
note a customer creating a table (or tables) up to 
level u (0 as the root, 1 for collection level, and 2 
for the document level), and u = % indicates no 
table has been created. Lor example, when a cus¬ 
tomer creates a table at the collection level, and 
the proxy customer in the collection level creates 
a table at the root level, u is 0. With this metaphor, 
let niz be the number of customers (including their 
proxies) served dish z at restaurant I, and let tiz be 
the number of tables serving dish z at restaurant 
I {I = 0 for root, f = c for collection level or 
/ = d for document level), with A^o = Ylz ’’^Oz and 
Nc = Y^z'^cz- By the chain rule, the conditional 
probability of the state assignments for Wi, given 
all others, is 

piVi, Zi, Ui\w, y_i, ...) 

N{y,z) +Sy N{w,y,z,C) + P 

°'A( 2 ) + Efc 4 N{y,z ,0 + Vp 


X < 


7QO 

7+-^0 


Q:o 


:s+l 5" 

r + 1 


r^ + l 
2 + 1 


7+-^0 
5-02 + 1 5 - 


tt03.(tc2+l)(td2+l) 


2+1 

e+1 (^d2+l)(^c2—t 


ap+Ni 


q'^dz 


(no 


(>i02 + l)("cj+l)(n.dj + l) 
+ 1 ) 


+ l){"d.+l) 

+ 1 


It = 0 
u = 1 

u = 2 


■n-dz+'i 


Here, S'f is the Stirling number, the ratios of which 


can be efficiently precomputed (Buntine and Hut- 
ter, 2010) . The concentration parameters 7 , ao, 
and can be sampled using the auxiliary variable 
method ( |Teh et ah, 2006[ ). 

Note that because conditional probability has 
the same separability as C-LDA (to give term q^j 
and qd), the same sampling framework can be 
used with two alterations: 1 ) when a new topic 
is created or removed at the root, collection, or 
document level, the related alias tables must be 
reset, which makes the sampling slightly slower 
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Figure 2: Held-out perplexity of C-LDA, C-HDP, ccLDA and TPYP fit to synthetic data, where Ki = 
K 2 = K (a; left) and data with an asymmetric number of topics (b; right). 


than 0 ( 1 ), and 2 ) while the document alias table 
samples z and u simultaneously, after sampling z 
from the word alias table u must be sampled using 
tic/niz (Chen et ah, 2011 1 . Parallelizing C-HDP 
requires an additional empirical method of merg¬ 
ing new topics between threads ([Newman et ah. 


20091, which is outside of the scope of this work. 


Our implementation of both models, C-LDA and 
C-HDP, are open-sourced online 


5 Experiments 
5,1 Model Comparison 


We use perplexity on held-out documents to eval¬ 
uate the performance of C-LDA and C-HDP. In all 
experiments, the gamma prior for a in C-LDA was 
set to (1,1), and (5,0.1), (5,0.1), (0.1, 0.1) for 
7 , ao. Oil respectively in C-HDP. In the hold-out 
procedure, 20 % of documents were randomly se¬ 
lected as test data. LDA, C-LDA and ccLDA were 
run for 1,000 iterations and C-HDP and the TII- 
variant of TPYP for 1,500 iterations (unless oth¬ 
erwise noted), all of which converged to a state 
where change in perplexity was less than 1 % for 
ten consecutive iterations. 

Perplexity was calculated from the marginal 
likelihood of a held-out document p(-w|$, a), es¬ 


timated using the “left-to-right” method (Wallach 


et ah, 2009b I. Because it is difficult to vali¬ 


date real-world data that exhibits different kinds 
of asymmetry, we use synthetic data generated 
specifically for our evaluation fasks (jAlSumaif ef 


jal., 2009] [Wallach ef ah, 2009b[ [Kucukelbir and 


Blei, 2014[ |. 


5.1,1 Topic Correlation 

C-LDA is unique in fhe amounf of freedom if al¬ 
lows when selling fhe number of topics for col- 

’https://github.com/iceboal/correlateci-lda 


lections. To assess fhe models’ performances wilh 
various lopic correlalions in a fair selling, we gen- 
eraled Iwo collections of synfhefic dala by fol¬ 
lowing fhe generalive process (varying fhe num¬ 
ber of fopics) and measured fhe models’ perplex- 
ilies againsf fhe ground frulh paramelers. In each 
experimenf, fwo collecfions were generafed, each 
wilh 1,000 documenls conlaining 50 words each, 
over a vocabulary of 3,000. fi and 6 were fixed al 
0.01 and 1.0 respectively, and a was asymmefri- 
cally defined as l/(i -|- \/Ac) for i G [ 0 , Kc — 1 ]. 

Completely shared topics The assumptions im¬ 
posed by ccLDA and TPYP effectively make them 
a special case of our model where iT® = Ki = 
K 2 = .... To compare results, data was gener¬ 
ated such that all numbers of topics were equal to 
K G [10, 90]. Additionally, all models were con¬ 
figured to use this ground truth parameter when 
training. Not surprisingly, ccLDA, C-LDA, and C- 
HDP have almost the same perplexity with respect 
to K because their structure is the same when all 
topics are shared (Figure [^). 

Asymmetric numbers of topics To explore the 
effect of asymmetry in the number of topics, data 
was generated such that one collection had Ki G 
[20,60] topics while a second had a fixed K 2 = 40 
topics. The number of shared topics was set to 
AT® = 20. The parameters for C-LDA and C-HDP 
(initial values) were set to ground truths, and, to 
retain a fair comparison, versions of ccLDA and 
TPYP were fit with both K = Ki and K = K 2 . 

We find that ccLDA performs nearly as well as 
C-LDA and C-HDP when there is more symme¬ 
try between collection, namely when Ki k. K 2 
(Figure [^). TPYP, on the other hand, performs 
well with more topics (2 x max(iTi,A' 2 ) where 
the ground truth is Ki & K 2 ). In contrast, C-LDA 






























and C-HDP perform more consistently than other 
models across varying degrees of asymmetry. 

Partially-shared topics When collections have 
the same number of topics, C-LDA, C-HDP and 
ccLDA exhibit adequate flexibility, resulting in 
similar perplexities. When collections have in¬ 
creasingly few common topics, however, common 
and non-common topics from ccLDA are con¬ 
siderably less distinguishable than those from C- 
LDA. To evaluate the models’ abilities in such sit¬ 
uations, data was generated for two collections 
having Ki = K 2 = 50 topics, but with the 
shared number of topics iT® G [5,45]. We also 
set (5^°^ = 5*-^^ = 5, and for comparison to ccLDA 
we used K = 50. 

To measure this distinguishability, we examine 
the inferred a. Recall that a indicates what per¬ 
centage of a common topic is shared. When a topic 
is actually non-common, the value of a should be 
small. We sort ak for k G [l,iT] in reverse and 
use 




common 


0-1 


non-common 


_ ^l_ 

— ^0 2^k=l 

_ 1 ^ 

— K-K'O l^k=K^+l 


( 2 ) 


as measures of how well common and non¬ 
common topics were leamecj^ ^common is the av¬ 
erage of the iT® largest a values, and dnon-common 
is the average of the rest. When (5^°^ = in the 
synthetic data, cr in the common portion should be 
0.5, whereas it should be 0 in the non-common 
part. Figure shows that C-LDA better distin¬ 
guishes between common and non-common top¬ 
ics, especially when iT® is small. This allows non¬ 
common topics to be separated from the results by 
examining the value of a. C-HDP has similar per¬ 
formance but larger a values. In ccLDA, all topics 
are shared between collections which means that 
common and non-common topics are mixed. As 
expected, ccLDA performs similarly when all top¬ 
ics are common across collections. 


5.2 Semantic Coherence 



Figure 3: Distinguishability (Eq. of topics fit 
with C-LDA,C-HDP and ccLDA. Blues lines de¬ 
note (Tcommon cUld red denote CTj^on-common- 


proxy human judgements of topic quality, is de¬ 
fined for a topic k as: 

i<j 

where D{-) computes the document co¬ 
occurrence. To accommodate coherence with 
common topics in C-LDA that have shared and 
collection-specific components we define mutual 
coherence, MC(/c), as 


MC(fc) 


1 




E 

ttiiGshared, 

Gcollection-specific 


log 


D{wi,Wj) + 1 

D{wi)D{wj) 


so that for each collection, C{k) (2n words) is 
equal to C(fe, shared) - 1 - C(A:, collection-specific) 
-I- MC(A:). Table [T] shows the semantic coherence 
of topics fit with ccLDA and C-LDA. We used a 
10% sample of JSTOR due to the limited speed of 
ccLDA, using 50 (common) topics for ccLDA / C- 
LDA, and 250 non-common humanities topics for 
C-LDA. Although these settings are different for 
the models, the science topics are still comparable 
because they both have 50 topics. We found that 
C-LDA provides improved coherence in nearly all 
situations. 


Semantic coherence is a corpus-based metric of 
the quality of a topic, defined as the average pair¬ 


wise similarity of the top n words (Newman et ah. 


2010a| [Mimno et ah, 20lT l. A PMI-based form 


of coherence, which has been found to be the best 


^TPYP is not comparable using this metric, but its hierar¬ 
chical structure will cause topics to mix naturally. 


5.2.1 Inference Efficiency 

To compare the model efficiency, we timed runs 
on a sample of 5,036 documents from JSTOR (in¬ 
troduced in the next section) with a 20% hold¬ 
out and set K = Ki = K 2 = 200 run on a 
commodity computer with four cores and 16GB 
of memory. Figure shows the perplexity over 
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Figure 4: Using JSTOR: perplexity vs. runtime and iterations (a; left) and perplexity vs. K (b; right). 



Coherence 

shared component 

collection-specific 

Mutual Coherence 

shared & collection-specific 


all documents 

science 

humanities 

science 

humanities 

science 

humanities 

C-LDA 

-8.83 

- 7.73 

-8.04 

-8.38 

-8.14 

- 8.54 

-8.37 

ccLDA 

-9.04 

- 8.22 

-8.27 

-8.38 

-8.15 

-8.69 

-8.40 

C-LDA 

-7.22 

- 3.68 

-6.11 

-8.25 

-8.09 

-7.75 

-7.97 

ccLDA 

-8.11 

- 5.68 

-7.12 

-8.24 

-7.88 

-8.22 

-7.95 


Table 1: Average semantic coherence of the 50 common topics from JSTOR (top) and the average of the 
10 best common topics judged by the mean value of different types of coherence (bottom). 


time and iterations. The inference algorithm intro¬ 
duces some staleness, which yields slower conver¬ 
gence in the first 200 iterations. This effect, how¬ 
ever, is outweighed in both C-LDA and C-HDP by 
the increased sampling speed. With 8 threads, C- 
LDA not only converges faster, but yields lower 
perplexity, likely due to threads introducing addi¬ 
tional stochasticity. 

5.3 Performance on JSTOR 

To compare our models against slower models, we 
sampled 2,465 documents from JSTOR, withhold¬ 
ing 20% as testing set. We fit a model with 100 
common and 50 non-common initial topics us¬ 
ing C-HDP, which produced 272 root topics after 
2,000 iterations.The perplexity scores are roughly 
the same when C-LDA uses the same average 
number of topics per collection (Figure [^), ex¬ 
cept when numbers of topics are very asymmet¬ 
ric. Our model begins to outperform ccLDA after 
80 topics. C-HDP did not, however, out-perform 
C-LDA despite the original HDP outperforming 
LDA. This could be do to the fact that the hier¬ 
archical structure of C-HDP is considerably differ¬ 
ent than the typical 2-level HDP. Held-out perplex¬ 
ity on real data provides a quantitative evaluation 
of our models’ performance in a real-world set¬ 
ting. However, the goal of our models is to enable 
a deeper analysis of large, weakly-related corpora, 
which we next discuss. 


5.4 Qualitative Analysis 

Our models are designed to enable researchers to 
compare collections of text in a way that is scal¬ 
able and sensitive to collection-level asymmetries. 
To demonstrate that C-LDA can fill this role, we 
fit a model to the entire JSTOR sciences and hu¬ 
manities collections with 100 science topics and 
1000 humanities topics (to reveal the less popu¬ 
lar science-related topics in the humanities), and 
/3 = 0.01, (5 = 1.0. JSTOR includes books and 
journal publications in over 9 million documents 
across nearly 3 thousand journals. We used the 
journal Science to represent a collection of scien¬ 
tific research and 76 humanist journals to repre¬ 
sent humanities research^ Words were lemma- 
tized, and the most and least frequent words dis¬ 
carded. The final humanities collection contained 
149,734 documents and the sciences collection 
had 160,680 documents, with a combined vocabu¬ 
lary of 21,513 unique words. Together, these col¬ 
lections typify a real-world situation where there 
is likely some, but not overwhelming correlation. 

The results indicate that the sciences and hu¬ 
manities share several topics. Both exhibit an in¬ 
terest in a “non-human” theme (common topic #2; 
Table 1^. This topic is quite similar in both collec¬ 
tions {pig and monkey for science documents; bird 
and gorilla for humanities documents), while their 
shared component forms a cohesive topic {animal, 

^The list is available at http://j.mp/humanities-txt. 
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shared 
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humanities 

shared 

science 

humanities 

shared 

science 

humanities 

animal 

pig 

heast 

economic 

cost 

rural 

particle 

energy 

universe 

specie 

fly 

creature 

government 

industry 

local 

physic 

electron 

quantum 

dog 

monkey 

nonhuman 

economy 

company 

community 

physicist 

ray 

physic 

wild 

guinea 

natural 

trade 

price 

village 

energy 

ion 

technical 

wolf 

primate 

humanity 

major 

market 

region 

experiment 

atom 

scientific 

monkey 

worm 

bird 

growth 

product 

urban 

event 

particle 

relativity 

horse 

dog 

living 

capital 

income 

country 

measurement 

mass 

physical 

sheep 

cat 

gorilla 

industry 

industrial 

area 

atom 

neutron 

mechanic 

lion 

mammal 

brute 

institution 

business 

regional 

interaction 

proton 

law 

cat 

cattle 

ape 

support 

private 

population 

atomic 

nucleus 

reality 


Table 2: Three topics from the JSTOR collections with their top words in shared and specific components. 

Complete results available at http://j.mp/jstor-html 


specie, and monkey). This kind of correlation is 
also evident in topic #23, about physics. While the 
science documents clearly represent research in 
particle physics, it is interesting to find the topic is 
also represented by humanist research focused on 
cultural representations of science. This reflects a 
growing interest in science and technology studies 
that has gained recent traction in the humanities. 
Despite their differences, both collections engage 
with a similar theme, seen in the shared compo¬ 
nent with words like particle, energy and atom. 

The results also indicate that while sciences and 
humanities documents can share themes, they of¬ 
ten diverge in how they are discussed. For exam¬ 
ple, common topic #21 could be identified as eco¬ 
nomic or capitalist, but in the collection-specific 
components, the two disciplines differ in their ar- 
ticulatation. Science uses terms like price and 
market, indicating an acceptance of free-market 
capitalism (especially as it affects the practice of 
science), while the humanities, which has long 
been critical of free-market capitalism, uses terms 
like rural and community, highlighting cultural 
facets of modem economics. These results pro¬ 
vide evidence about how ideas move between the 
sciences and humanities — a phenomenon that 
constitutes a growing area of research for histori¬ 
ans ( Galison, 2003 1 Canales, 2015| ). C-LDA pro¬ 
vides empirical, measurable, and reproducible ev¬ 
idence of the shared research between these disci¬ 
plines, as well as how concepts are articulated. 


6 Discussion 


Our models provide a robust way to explore 
large and potentially weakly-related text collec¬ 
tions without imposing assumptions about the 
data. Like ccLDA and TPYP, our models ac¬ 
count for topic-word variation at the collection 
level. The models accommodate asymmetry in 


the numbers of topics (set in C-LDA, fit in C- 
HDP) and provide an efficient inference method 
which allows them to fit data with large values 
for K, which can help find correlations in less 
prevalent topics. Our primary contribution is our 
models’ ability to accommodate asymmetries be¬ 
tween arbitrary collections. JSTOR, the world’s 
largest digital collection of humanities research, 
was an ideal application setting given the size, 
asymmetry, and comprehensiveness of the human¬ 
ities collection. As we show, humanities and 
science research exhibit asymmetries with regard 
to vocabulary and topic structure — asymmetries 
that would be systematically overlooked using ex¬ 
isting models. By characterizing common top¬ 
ics as mixtures of shared and collection-specific 
components, we can capture a kind of topic-level 
homophily, where similar themes are articulated 
in different ways due to word-, document-, and 
collection-level variation. Future work on these 
models could explore methods to fit non-common 
topics for both collections. In general, C-LDA and 
C-HDP can be used whenever documents are sam¬ 
pled from ostensibly different populations, where 
the nature of the difference is unknown. 
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